Skip to content

BUG: fix block fragmentation in DataFrame.astype with dict dtype#63461

Closed
Chiwendaiyue wants to merge 8 commits intopandas-dev:mainfrom
Chiwendaiyue:FIX_astype_block
Closed

BUG: fix block fragmentation in DataFrame.astype with dict dtype#63461
Chiwendaiyue wants to merge 8 commits intopandas-dev:mainfrom
Chiwendaiyue:FIX_astype_block

Conversation

@Chiwendaiyue
Copy link
Copy Markdown
Contributor

  • closes BUG: DataFrame.astype leave the dataframe extremely fragmented (one block per column) #63433
    This PR fixes the issue by actively consolidating blocks of the same dtype after a dictionary-based astype operation. The fix is minimal (the alternative code change I thought is to determine and perform the correct partitioning behavior during the initial transformation. )and I think it's safe.
    It adds a call to _consolidate_inplace() on the result's BlockManager when dtype is a dict.
    The consolidation is wrapped in a try-except block with a warning to ensure it never breaks the core functionality of astype. Failures are silent and backward compatible.
    I tried the Reproducible Example and it worked well. If there is any problem, I'm happy to fix it.

@Chiwendaiyue
Copy link
Copy Markdown
Contributor Author

I've implemented a fix that consolidates blocks only when block count explodes (currently, when blocks == columns). I'm unsure if this threshold is optimal. It feels somewhat subjective. Could any maintainer provide guidance on a better criterion please? Thanks!

Copy link
Copy Markdown
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Comment thread pandas/core/generic.py
Comment on lines +6530 to +6534
warnings.warn(
f"astype block consolidation failed: {type(e).__name__}",
UserWarning,
stacklevel=2,
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what case can this fail?

Comment thread pandas/core/generic.py
total_cols = len(self.columns)
# only when the number of blocks explode do this
if current_blocks == total_cols and total_cols > 5:
mgr._consolidate_inplace()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still creating a very fragmented DataFrame and then performing a copy. We would prefer not fragmenting the DataFrame at all in the first place (I think this should be possible).

@mroeschke
Copy link
Copy Markdown
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Jan 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.astype leave the dataframe extremely fragmented (one block per column)

3 participants